We ought to analyze global trends on terrorism and answer questions such as which countries suffer most from terrorism? Has terrorism activities increased globally? How many victims terrorist activities claim each year? Then we will try to predict terrorist activities using machine learning techniques. We will visualize terrorist activities globally and in the US, analyze global and local terrorist activities data, and predict terrorist activities in the US. Our data is obtained from The Global Terrorism Database (GTD) and includes terrorism data from 1970 to 2018. The main tools used are Pandas data frames, Folium for visualization, Numpy and Scipy for machine learning.
We hope this work is insightful and enable us in the future to accurately predict and prevent terrorist attacks.
import folium
from folium.plugins import MarkerCluster
import requests
import pandas as pandas
import numpy as numpy
import re
import html5lib
from datetime import datetime
from bs4 import BeautifulSoup
import json
import matplotlib.pyplot as plot
from sklearn import linear_model
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
To obtain the data on terrorist attacks, we registered an account on the Global Terrorism Database (GTD) website. Afterward, we had access to the data on terrorist attacks between 1970 – 2018. After downloading the data, we use Microsoft Excel to convert it into CSV. Then we load the data into python a Pandas Dataframe.
The following code loads the data into a Pandas Dataframe and prints the first few data points.
Download Data Here:
https://www.start.umd.edu/gtd/
input_file = "globalterrorismdb_0919csv.csv"
global_terrorism_data = pandas.read_csv(input_file)
global_terrorism_data.head(5)
The following code prints out the size of our data.
global_terrorism_data.shape
Our data is composed of 191464 rows and 135 columns. However, there are several missing values points from the data and several columns we will not be used for our analysis. We can start by selecting only the necessary columns from the dataset and then removing any missing value where it makes sense to do so. For example, we will remove any row with missing year and location.
columns_needed = ["iyear", "imonth", "iday", "country", "country_txt", "region", "region_txt", "provstate",
"city", "latitude", "longitude", "targtype1", "targtype1_txt", "attacktype1",
"attacktype1_txt", "success", "nkill", "nwound", "gname", "claimed", "weaptype1", "weaptype1_txt"]
# Using Pandas dataframe to create new dataframe from our previous one.
global_terrorism_data_needed = pandas.DataFrame(global_terrorism_data, columns = columns_needed)
# Drop rows with any missing year and location in the selected columns
global_terrorism_data_needed = global_terrorism_data_needed.dropna(how='any', subset=['latitude', 'longitude', "iyear"])
# Lets look at the size of our data now
global_terrorism_data_needed.shape
# lets take a few data points
global_terrorism_data_needed.head()
We first need to look at the number of attacks every year and see if there is any trend. Then we can look at the number of attack by continent/country. To do this, we simply group our data using the variable we need and plot it.
# Using Pandas dataframe feature to count the number of attack by year
global_terrorism_year_counts = global_terrorism_data_needed.groupby('iyear').size().reset_index(name='attack_counts')
global_terrorism_year_counts.shape
global_terrorism_year_counts.head()
Let's plot the result to take better look.
plot.figure(figsize=(15,6))
plot.title('Counts of terrorist attacks from 1970 to 2018')
plot.xlabel('Year')
plot.ylabel('attack counts')
plot.scatter("iyear", "attack_counts", data = global_terrorism_year_counts);
There is no obvious trend to note here. According to the plot, terrorist attacks were on the rise until 1992 then we see a decline up to 2004 followed by a sudden increase in terrorist activities from 2010 to 2014 reaching a peak in 2014. Then a steady decrease from 2015 to 2018. This trend could be the result of several global variables that are not present in our data.
Next, we look at the number of attacks per region.
# group by region
global_terrorism_region_counts = global_terrorism_data.groupby(['region', 'region_txt']).size().reset_index(name='attack_counts')
global_terrorism_region_counts
# the plot of attacks by region
plot.figure(figsize=(15,6))
plot.title('Terrorist attack by regions from 1970 to 2018')
plot.xlabel('region')
plot.ylabel('attack_counts')
plot.scatter("region", "attack_counts", data = global_terrorism_region_counts);
Most attacks are concentrated in South Asia, Middle East and North Africa. Australia and Oceania have the lowest number of attack. It will be interesting to see how target regions changed over time. Let show that with a histogram.
plot_region = global_terrorism_data['iyear'].hist(by=global_terrorism_data['region_txt'], figsize =(20,20),
ylabelsize = 20, xlabelsize = 20, edgecolor='black', linewidth=1.2,
bins = 24 )
The highest number of attacks in North America occurred between 1970 and 1972. During that time, there was almost no attack in other regions besides Western Europe. The regions with the most between 2010 and 2018 were not under attack then. Compared to the Middle East, North Africa, Sub-Saharian Africa, South Asia, Southeast Asia, or even Western Europe, the number of attacks in North America is pretty low. Now Lets look at the distribution by country.
# number of attacks by country
global_terrorism_country_counts = global_terrorism_data.groupby('country_txt').size().reset_index(name='attack_counts')
global_terrorism_country_counts
global_terrorism_country_counts.agg([max, min])
Next we look at the type of attack and the target of attacks.
global_terrorism_attacktype_counts = global_terrorism_data.groupby('attacktype1_txt').size().reset_index(name='attack_counts')
global_terrorism_attacktype_counts
global_terrorism_targettype_counts = global_terrorism_data.groupby('targtype1_txt').size().reset_index(name='attack_counts')
global_terrorism_targettype_counts
We can better visualize these results by plotting the number of occurrences of each attack type per year on a histogram.
plot_attack = global_terrorism_data['iyear'].hist(by=global_terrorism_data['attacktype1_txt'], figsize =(20,20),
ylabelsize = 20, xlabelsize = 20, edgecolor='black', linewidth=1.2,
bins = 24)
Globally all attack types have increased. The most growth is between 2014 and 2018. Bombing/ Explosion attacks have the most increase.
Lets show attacks on each region on a map. We will use folium to generate the map.
Folium is a data visualization library in Python built primarily to help visualize geospatial data. With Folium, we can create a map of any location in the world as long as its latitude and longitude values are known. Also, the maps created by Folium are interactive, so we can zoom in and out after the map is rendered, which is a very useful feature.
Since we are workin with geospatial maps, we need the country coordinates for plotting. Download the file from https://github.com/python-visualization/folium/blob/master/examples/data/world-countries.json
#Plot of the first 1000 attack in 1970, 1990, 2000, 2014, and 2018.
global_terrorism_1970 = global_terrorism_data_needed.groupby('iyear').get_group(1970)
global_terrorism_1990 = global_terrorism_data_needed.groupby('iyear').get_group(1990)
global_terrorism_2000 = global_terrorism_data_needed.groupby('iyear').get_group(2000)
global_terrorism_2014 = global_terrorism_data_needed.groupby('iyear').get_group(2014)
global_terrorism_2018 = global_terrorism_data_needed.groupby('iyear').get_group(2018)
# Initialize the global map using foolium
latitude = 0.0
longitude = 50
global_map_1970 = folium.Map(location=[latitude, longitude], zoom_start=1.5)
cluster = MarkerCluster().add_to(global_map_1970)
# Showing the first 1000 data points
size = 1000
for each in global_terrorism_1970[0:size].iterrows():
folium.Marker([each[1]['latitude'],each[1]['longitude']],
popup=('City: ' + str(each[1]['city']).capitalize() + '<br>'
'Target: ' + str(each[1]['targtype1_txt']) + '<br>'
'Attack Type: ' + str(each[1]['attacktype1_txt']).capitalize() + '<br>')
).add_to(cluster)
global_map_1970
# Initialize the global map using foolium
latitude = 0.0
longitude = 50
global_map_1990 = folium.Map(location=[latitude, longitude], zoom_start=1.5)
cluster = MarkerCluster().add_to(global_map_1990)
# Showing the first 1000 data points
size = 1000
for each in global_terrorism_1990[0:size].iterrows():
folium.Marker([each[1]['latitude'],each[1]['longitude']],
popup=('City: ' + str(each[1]['city']).capitalize() + '<br>'
'Target: ' + str(each[1]['targtype1_txt']) + '<br>'
'Attack Type: ' + str(each[1]['attacktype1_txt']).capitalize() + '<br>')
).add_to(cluster)
global_map_1990
# Initialize the global map using foolium
latitude = 0.0
longitude = 50
global_map_2000 = folium.Map(location=[latitude, longitude], zoom_start=1.5)
cluster = MarkerCluster().add_to(global_map_2000)
# Showing the first 1000 data points
size = 1000
for each in global_terrorism_2000[0:size].iterrows():
folium.Marker([each[1]['latitude'],each[1]['longitude']],
popup=('City: ' + str(each[1]['city']).capitalize() + '<br>'
'Target: ' + str(each[1]['targtype1_txt']) + '<br>'
'Attack Type: ' + str(each[1]['attacktype1_txt']).capitalize() + '<br>')
).add_to(cluster)
global_map_2000
# Initialize the global map using foolium
latitude = 0.0
longitude = 50
global_map_2014 = folium.Map(location=[latitude, longitude], zoom_start=1.5)
cluster = MarkerCluster().add_to(global_map_2014)
# Showing the first 1000 data points
size = 1000
for each in global_terrorism_2014[0:size].iterrows():
folium.Marker([each[1]['latitude'],each[1]['longitude']],
popup=('City: ' + str(each[1]['city']).capitalize() + '<br>'
'Target: ' + str(each[1]['targtype1_txt']) + '<br>'
'Attack Type: ' + str(each[1]['attacktype1_txt']).capitalize() + '<br>')
).add_to(cluster)
global_map_2014
# Initialize the global map using foolium
latitude = 0.0
longitude = 50
global_map_2018 = folium.Map(location=[latitude, longitude], zoom_start=1.5)
cluster = MarkerCluster().add_to(global_map_2018)
# Showing the first 1000 data points
size = 1000
for each in global_terrorism_2018[0:size].iterrows():
folium.Marker([each[1]['latitude'],each[1]['longitude']],
popup=('City: ' + str(each[1]['city']).capitalize() + '<br>'
'Target: ' + str(each[1]['targtype1_txt']) + '<br>'
'Attack Type: ' + str(each[1]['attacktype1_txt']).capitalize() + '<br>')
).add_to(cluster)
global_map_2018
Let us explore the possibility of relationship between the attacks and the year variable. From the plot Counts of terrorist attacks from 1970 to 2018 it is obvious that a linear relationship is not expected but maybe other variable such as region wouold would be impactfut to describe the relationship. First we apply a linear model to our data. Linear regression module from sklearn is very usefull here.
X = global_terrorism_year_counts.iyear.values
Y = global_terrorism_year_counts.attack_counts.values
X = X.reshape(-1, 1)
Y = Y.reshape(-1, 1)
reg = LinearRegression().fit(X, Y)
reg.score(X, Y)
#reg.coef
We can look at the plot to see that this is not a good regression for our data.
plot.figure(figsize=(15,6))
plot.scatter(X, Y, color='black')
plot.plot(X, reg.predict(X), color='blue', linewidth=3)
plot.show()
Using statsmodels api, we can get more info on the regression.
models = sm.OLS(Y,X)
results = models.fit()
print(results.summary())
First, we have the dependent variable and the model and the method. OLS stands for Ordinary Least Squares and the method “Least Squares” means that we’re trying to fit a regression line that would minimize the square of the distance from the regression line.
The coefficient of 1.9619 means as the RM variable increases by 1, the predicted value of MDEV increases by 1.9619. A few other important values are the R-squared, the percentage of variance our model explains. Here it is very low which suggests that this is a bad fit; the standard error which is the standard deviation of the sampling distribution from the mean is very large indicating the bag fit.
We can add the region variable and check for improvement. However, the region data is missing several variables. We can obtain another data set by grouping by years and region.
Also, further analysis would focus on the US and analyze the relationship between the number of attacks, the attack type, the location of attacks, the type of attack and more.